Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

handle unresponsive nodes #474

Merged
merged 16 commits into from
Sep 16, 2024
Merged

handle unresponsive nodes #474

merged 16 commits into from
Sep 16, 2024

Conversation

evanharmon
Copy link
Contributor

@evanharmon evanharmon commented Aug 27, 2024

Motivation

Updates the eip-operator to handle unresponsive egress nodes. A new label is set on egress nodes indicating their status. New logic is added to disassociate EIPs from unresponsive nodes. A separate function handles cleaning up unresponsive egress nodes to update their status label.

Important: The README.md has been updated to reflect this repo being archived in the near future.

Notes for reviewer

Clippy warnings did not like the nested conditional blocks on match statements. I did some reasonable refactoring to meet the warnings. Not sure if it actually helps readability. What was more important to me was minimizing unnecessary API calls where possible.

Testing

I've tested egress nodes going unresponsive and egress nodes being rolled as well. The code changes match the existing behavior except now egress moves to the new node when the old one goes unresponsive.

@evanharmon evanharmon self-assigned this Aug 27, 2024
@evanharmon evanharmon requested a review from a team as a code owner August 27, 2024 23:18
@evanharmon evanharmon marked this pull request as draft August 27, 2024 23:19
@evanharmon evanharmon force-pushed the evan/node-responsiveness-updates branch 2 times, most recently from 50f1429 to 2909126 Compare August 28, 2024 19:17
@evanharmon evanharmon force-pushed the evan/node-responsiveness-updates branch from 2909126 to be070fb Compare August 28, 2024 19:31
@evanharmon evanharmon marked this pull request as ready for review August 29, 2024 14:08
@evanharmon evanharmon requested review from pH14 and jubrad August 29, 2024 14:09
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ooh very cool to see progress on this issue. My main comment is structural -- I think we should try to remove the global reasoning about nodes from the per-node reconciliation. The least invasive possibility I can think of is to pull the egress gateway labeling into its own control loop, perhaps worth a shot!

eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved
// The EIP will be disassociated and then this apply() will exit early since no EIP is associated
warn!("Node {} is in Unknown state, disassociating EIP", &name);
crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?;
crate::eip::set_status_detached(&eip_api, &eip_name).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is race-y... many apply fns for different nodes can run concurrently by default, and this set_status operation doesn't include Kube's resourceVersion to perform optimistic locking

as a result, you could have:

  1. Node A has the EIP and is marked Unknown
  2. Node B spins up and the reconciler runs set_status_should_attach to attach it to B
  3. The reconciler for Node A runs set_status_detached and removes it from B
  4. Both reconcilers finish and the EIP is not marked as attached to either node 😱

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a Kube replace_status would add in some optimistic locking so the two reconcilers cannot overwrite each other, though I haven't thought much about how hard the error cases would be to reason about

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


return Ok(None);
}
_ => return Ok(None),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we not need to do anything for the other (true, true, "Unknown") | (true, false, "Unknown") | (false, true, "Unknown") cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worked with Justin on the cases and refactoring of the node controller. I think this is simplified / covered now.

eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved
eip_operator/src/controller/node.rs Outdated Show resolved Hide resolved
// This code path runs a limited number of times once a node goes unresponsive.
// The EIP will be disassociated and then this apply() will exit early since no EIP is associated
warn!("Node {} is in Unknown state, disassociating EIP", &name);
crate::aws::disassociate_eip(&self.ec2_client, &eip_name).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if for any reason the eip-operator crashed between executing disassociate_eip and set_status_detached, would we be able to reconcile and clear the status correctly?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pH14 Good thought! We should probably remove the resource-id label first! This would allow the EIP to be reclaimed by another node (or potentially the same node).

We might also want to swap the order in the node controller.
In the node controller, if the node is ready, it should see if it has an EIP already claimed and ensure it's attached to that eip. If no eip it's not attached to an EIP it should attempt to find one, and claim it, then it should attempt to associate and update the other status fields. (this can probably be done in a second PR, but if we do it this way it may remove where we can't fix the resource ID of the eip if it got removed, but the EIP is still associated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this idea! Leaving for a separate PR as discussed.

Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Feels like this is getting close! Left a few more comments, but the structure here feels much easier to grok

kube_runtime::watcher::Config::default().labels(EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY);

task::spawn(async move {
let mut watcher = pin!(kube_runtime::watcher(node_api.clone(), watch_config));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hm I'd think we'd want to watch Eip resources rather than Node, since we set the Eip's status each time the Eip is associated or disassociated. it's not clear to me that this fires if we associate an Eip right now

separately, this watcher is also self-triggering, where setting the egress gateway status will trigger another reconciliation. not a correctness issue if it converges, but could create some read/write amplification or hard-to-debug issues if it does anything unexpected

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good find! The watcher is now correctly configured on Eip resources.

eip_operator/src/egress.rs Outdated Show resolved Hide resolved
eip_operator/src/egress.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@pH14 pH14 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! This really feels you got to the essence of the problem here, the control loops feel very well-scoped, and the label_egress_nodes is super clean.

README.md Outdated
@@ -1,3 +1,4 @@
⚠️⚠️ **WARNING!** ⚠️⚠️ The Materialize K8s-eip-operator will soon be archived and no longer be under active development.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

probably enough to say that it will soon be archived or will soon be archived and is no longer under active development.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cleaned up the wording per node improvements. I left in the warning portion as I found that in another MZ public archive as a practice.

Comment on lines -240 to +241
eip.meta_mut().resource_version = eip_v1.metadata.resource_version.clone();
let resource_version = eip_v1.metadata.resource_version.clone();
eip.meta_mut().resource_version = resource_version;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this changing here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After updating Rust I ran in to a clippy error. This change fixes:

assigning the result of Clone::clone() may be inefficient

Comment on lines 129 to 131
// However, unresponsive nodes may still exist with a ready-status egress label
// set to "true". This allows the old node to still serve traffic if possible until a
// new node is ready to take over.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this agress with

    // We immediately disassociate EIPs from egress nodes that are in an
    // unresponsive state in the eip node controller.

If the EIP has been disassociated from a node it should not be serving traffic, we shouldn't wait to remove the status.

I'm not saying we want to do this, but if this is the behavior we're looking for I think we'd have to do something like adding a label to the EIP that says it can be reclaimed (or removing the eip resource id), but not actually detach it when a node becomes unresponsive.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Discussed over slack and covered by the egress labeling and node controller refactors. We agreed to set the status of the EIP to detached to free the claim. Egress will continue to attempt to go out the unresponsive node on the EIP as it will still be attached to the instance.

eip_api: &Api<Eip>,
node_api: &Api<Node>,
) -> Result<(), Error> {
let egress_nodes = get_egress_nodes(&Api::all(node_api.clone().into())).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't want to get all egress nodes, just the nodes that would match the EIP being reconciled.

I think the following should work.

let egress_nodes = node_api.list(&eip.get_resource_list_params()).await?.items;

This will break if the the eip is a pod filtering eip, we could return early if we see that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize this function existed - much cleaner!. Covered in egress improvements

.into_iter()
.filter_map(|eip| eip.status.and_then(|s| s.resource_id))
.collect();

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we want to return early if there isn't an attachment that's ready we can return early if there isn't an attachment here, then I think we can just check if the attachment for the current eip is ready, if so we ensure it's marked as egress_ready, and everything else is marked as egress_unready. If it's not ready we do nothing.

https://github.com/MaterializeInc/k8s-eip-operator/pull/474/files#diff-d5b1eb5e62a747acb5112f073726318730e004f6ce190d0e2f1c11b3868cf8ffR108-R116

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines 66 to 69
// Note(Evan): find nodes that match eips we're reconciling
// if eip has a resource id, see if the node with the resoruce is ready
// if no, do nothing
// if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
// Note(Evan): find nodes that match eips we're reconciling
// if eip has a resource id, see if the node with the resoruce is ready
// if no, do nothing
// if yes, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false
// If EIP being reconciled has a resource id, find it's node, and check if it's ready.
// If not ready, return early.
// If ready, mark that node as egress_ready=true, and mark all other nodes as egress_ready=false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Covered in egress improvements

Comment on lines 94 to 100
for other_node in egress_nodes
.iter()
.filter(|n| n.name_unchecked() != node.name_unchecked())
{
add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false")
.await?;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you already have a for each here, you may as well just

            for other_node in egress_nodes {
                // skip the node we just set to ready
                if other_node.name_unchecked() != node.name_unchecked()) {
                    add_gateway_status_label(node_api, other_node.name_unchecked().as_str(), "false")
                        .await?;
                } 
            }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see now I already have the current node name. Adjusted as part of egress improvements

Comment on lines 39 to 53
/// Retrieve all egress nodes in the cluster.
async fn get_egress_nodes(api: &Api<Node>) -> Result<Vec<Node>, kube::Error> {
let params = ListParams::default().labels(
format!(
"{}={}",
EGRESS_GATEWAY_NODE_SELECTOR_LABEL_KEY, EGRESS_GATEWAY_NODE_SELECTOR_LABEL_VALUE
)
.as_str(),
);

match api.list(&params).await {
Ok(node_list) => Ok(node_list.items),
Err(e) => Err(e),
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably don't need this if we're using the node selector on the eip.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner and removed the need for some CONST labels in egress improvements

// An Unknown ready status could mean the node is unresponsive or experienced a hardware failure.
// A NotReady ready status could mean the node is experiencing a network issue.
_ => {
// Skip detachment if no EIP is associated with this node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is confusing..

If an EIP has a resource id label pointing to this node, remove that label releasing this nodes claim to the EIP

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree. I changed the wording as part of node improvements

Comment on lines 94 to 103
if let Some(eip) = node_eip {
warn!(
"Node {} is in an unresponsive state, detaching EIP {}",
&name.clone(),
&eip.name_unchecked()
);
crate::eip::set_status_detached(&eip_api, eip).await?;

return Ok(None);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we moved this return out of the if and into the match block (with an appropriate comment/debug message) we could could avoid having to later check if the node is ready in line 134.

https://github.com/MaterializeInc/k8s-eip-operator/pull/474/files#diff-48a05fa05e5fac935d49f8d16c17290587389f1821d5e5d237c624224763fc7fR134

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much cleaner. I was glad to be able to remove the ready_status check from further on in the code. I simplified it in node improvements

Copy link
Contributor

@jubrad jubrad left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approving, but I think you should change dbg! to debug! before merging

}

dbg!("Node {} is not ready, skipping EIP claim", &name);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should this be debug!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right! Just pushed up a fix.

@evanharmon evanharmon merged commit 06f16b8 into main Sep 16, 2024
2 checks passed
@evanharmon evanharmon deleted the evan/node-responsiveness-updates branch September 16, 2024 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants